Dirichlet draws are sparse with high probability
نویسنده
چکیده
This note provides an elementary proof of the folklore fact that draws from a Dirichlet distribution (with parameters less than 1) are typically sparse (most coordinates are small). 1 Bounds Let Dir(α) denote a Dirichlet distribution with all parameters equal to α. Theorem 1.1. Suppose n ≥ 2 and (X1, . . . , Xn) ∼ Dir(1/n). Then, for any c0 ≥ 1 satisfying 6c0 ln(n) + 1 < 3n, Pr [∣∣∣∣{i : Xi ≥ 1 nc0 }∣∣∣∣ ≤ 6c0 ln(n)] ≥ 1− 1 nc0 . The parameter is taken to be 1/n, which is standard in machine learning. The above theorem states that (with high probability) as the exponent on the sparsity threshold grows linearly (n−1, n−2, n−3, . . .), the number of coordinates above the threshold cannot grow faster than linearly (6 ln(n), 12 ln(n), 18 ln(n), . . .). The above statement can be parameterized slightly more finely, exposing more tradeoffs than just the threshold and number of coordinates. Theorem 1.2. Suppose n ≥ 1 and c1, c2, c3 > 0 with c2 ln(n) + 1 < 3n, and (X1, . . . , Xn) ∼ Dir(c1/n); then Pr [ |{i : Xi ≥ n−c3}| ≤ c2 ln(n) ] ≥ 1− 1 e1/3 ( 1 n ) c2 3 −c1c3 − 1 e4/9 ( 1 n ) 4c2 9 . The natural question is whether the factor ln(n) is an artifact of the analysis; simulation experiments with Dirichlet parameter α = 1/n, summarized in Figure 1a, exhibit both the ln(n) term, and the linear relationship between sparsity threshold and number of coordinates exceeding it. The techniques here are loose when applied to the case α = o(1/n). In particular, Figure 1b suggests α = 1/n leads to a single nonsmall coordinate with high probability, which is stronger than what is captured by the following theorem. Theorem 1.3. Suppose n ≥ 3 and (X1, . . . , Xn) ∼ Dir(1/n); then Pr [ |{i : Xi ≥ n−2}| ≤ 5 ] ≥ 1− e2/e−2 − e−8/3 ≥ 0.64. Moreover, for any function g : Z++ → R++ and any n satisfying 1 ≤ ln(g(n)) < 3n− 1, Pr [ |{i : Xi ≥ n−2}| ≤ ln(g(n)) ] ≥ 1− e2/e−1/3 ( 1 g(n) )1/3 − e−4/9 ( 1 g(n) )4/9 . (Take for instance g to be the inverse Ackermann function.) 1 ar X iv :1 30 1. 49 17 v1 [ cs .L G ] 2 1 Ja n 20 13
منابع مشابه
Introducing of Dirichlet process prior in the Nonparametric Bayesian models frame work
Statistical models are utilized to learn about the mechanism that the data are generating from it. Often it is assumed that the random variables y_i,i=1,…,n ,are samples from the probability distribution F which is belong to a parametric distributions class. However, in practice, a parametric model may be inappropriate to describe the data. In this settings, the parametric assumption could be r...
متن کاملOn the Role of Total Variation in Compressed Sensing
This paper considers the problem of recovering a one or two dimensional discrete signal which is approximately sparse in its gradient from an incomplete subset of its Fourier coefficients which have been corrupted with noise. We prove that in order to obtain a reconstruction which is robust to noise and stable to inexact gradient sparsity of order s with high probability, it suffices to draw O ...
متن کاملA hierarchical latent topic model based on sparse coding
We propose a novel hierarchical latent topic model based on sparse coding in this paper. Unlike the other topic models applied in the computer vision field, the words in our model are not discrete but continuous. They are generated by sparse coding and represented with n-dimensional vectors in R. In sparse coding, only a small set of components of each word is active, so we assume the probabili...
متن کاملFinding a Minimally Informative Dirichlet Prior Using Least Squares
Abstract In a Bayesian framework, the Dirichlet distribution is the conjugate distribution to the multinomial likelihood function, and so the analyst is required to develop a Dirichlet prior that incorporates available information. However, as it is a multiparameter distribution, choosing the Dirichlet parameters is less straightforward than choosing a prior distribution for a single parameter,...
متن کاملBayesian shrinkage
Penalized regression methods, such as L1 regularization, are routinely used in high-dimensional applications, and there is a rich literature on optimality properties under sparsity assumptions. In the Bayesian paradigm, sparsity is routinely induced through two-component mixture priors having a probability mass at zero, but such priors encounter daunting computational problems in high dimension...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1301.4917 شماره
صفحات -
تاریخ انتشار 2013